Xiaorui (Jeremy) Zhu
02/10/2024
“Unless the data is something I’ve analyzed a lot before, I usually feel like the blind men and the elephant.” – Jeff Leek
Data preparation accounts for about 80% of the work of data scientists
In business analytics, data comes from various sources such as databases, spreadsheets, and external sources. These sources often contain missing values, duplicate entries, outliers, and formatting inconsistencies, which can strongly skew the analysis results and lead to incorrect business decisions if not addressed properly.
Data cleaning is a crucial step in the process of preparing data for analysis in the field of business analytics. It involves identifying and correcting errors, inconsistencies, and inaccuracies in the dataset to ensure its reliability and quality.
It is important and time-consuming,
but, do you enjoy data cleaning?
The following two datasets are the raw and clean data. The Raw dataset contains the headers and other information that triggers missing values. Without cleaning, it cannot be directly used. On the contrary, the clean dataset presented right after has the rows and colomns clearly stroed.
# install.packages("readxl")
Retail <- readxl::read_xlsx(path = "data/RetailSales2018.xlsx")
head(Retail)## # A tibble: 6 × 5
## RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA> <NA> <NA> <NA>
## 2 CPI Table 24(http://www.bls.gov/cpi) <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 Year Month Sales S_Fa… CPI
## 6 1992 JAN 1471… 0.89… 138.1
## # ℹ abbreviated name:
## # ¹`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
## # A tibble: 318 × 5
## Year Month Sales S_Factor CPI
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1992 JAN 147182. 0.897 138.
## 2 1992 FEB 147013. 0.895 139.
## 3 1992 MAR 159653. 0.975 139.
## 4 1992 APR 163606. 0.993 140.
## 5 1992 MAY 170089. 1.03 140.
## 6 1992 JUN 168922. 1.02 140.
## 7 1992 JUL 169982. 1.02 140.
## 8 1992 AUG 170650. 1.02 141.
## 9 1992 SEP 164664. 0.972 141.
## 10 1992 OCT 173583. 1.02 142.
## # ℹ 308 more rows
Data sources: Retail Sales Raw Data and Retail Sales Clean Data
In R programming language, data wrangling skills are essential for efficiently manipulating and preparing data for analysis. Some important data wrangling skills in R include:
By performing thorough data cleaning, business analysts can improve the quality and reliability of their analyses, leading to more accurate insights and better-informed business decisions. It lays the foundation for successful data analysis and ensures that businesses can trust the results derived from their data-driven initiatives.
R supports the importing from various sources such as CSV files, Excel spreadsheets, databases (e.g., MySQL, PostgreSQL), web APIs, and other formats.
| Option | Function | Speed in seconds |
|---|---|---|
| Base R | read.csv("data/flights.csv") |
~ 4 seconds |
library(readr) |
read_csv("data/flights.csv") |
~ 0.8 seconds |
library(data.table) |
fread("data/flights.csv") |
~ 0.3 seconds |
library(readxl) |
read_excel() |
|
Database: library(DBI) |
dbSendQuery() |
|
Web APIs: library(httr) |
GET(url = NULL,...) |
read.csv() and read_excel()We focus on read.csv() function to read in .csv files,
and read_excel() to read in .xlsx files.
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## # A tibble: 6 × 5
## RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr>
## 1 SEASONAL FACTORS(http://www.census.gov/retail/marts/w… <NA> <NA> <NA> <NA>
## 2 CPI Table 24(http://www.bls.gov/cpi) <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 Year Month Sales S_Fa… CPI
## 6 1992 JAN 1471… 0.89… 138.1
## # ℹ abbreviated name:
## # ¹`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
Skills to clean and preprocess data, which may involve handling
missing values (NA), removing duplicates, converting data
types, dealing with outliers, and detecting and correcting errors in the
dataset using functions like na.omit(),
complete.cases(), duplicated(), and various
functions from the dplyr and tidyr
packages.
## # A tibble: 6 × 5
## RETAIL & FOOD SERVICES(http://www.census.gov/retail/…¹ ...2 ...3 ...4 ...5
## <chr> <chr> <chr> <chr> <chr>
## 1 Year Month Sales S_Fa… CPI
## 2 1992 JAN 1471… 0.89… 138.1
## 3 1992 FEB 1470… 0.89… 138.6
## 4 1992 MAR 1596… 0.97… 139.…
## 5 1992 APR 1636… 0.99… 139.5
## 6 1992 MAY 1700… 1.02… 139.…
## # ℹ abbreviated name:
## # ¹`RETAIL & FOOD SERVICES(http://www.census.gov/retail/marts/www/timeseries.html)`
# Change colomn names
names(Retail2) <- as.character(Retail2[1, ])
Retail2 <- Retail2[-1, ]
Retail2[1:3,]## # A tibble: 3 × 5
## Year Month Sales S_Factor CPI
## <chr> <chr> <chr> <chr> <chr>
## 1 1992 JAN 147182.451 0.89700000000000002 138.1
## 2 1992 FEB 147012.70000000001 0.89500000000000002 138.6
## 3 1992 MAR 159653.32499999998 0.97499999999999998 139.30000000000001
## [1] 318 5
## [1] 318 5
Ability to transform data by creating new variables, reshaping data
from wide to long format (and vice versa), and summarizing data using
functions like mutate(), select(),
filter(), arrange(), group_by(),
summarize() from the dplyr package, and
functions like pivot_longer() and
pivot_wider() from the tidyr package.
Skills to perform various data manipulation tasks such as
merging/joining datasets, splitting data into subsets, and reshaping
data using functions like merge(), join(),
split(), reshape(), rbind(),
cbind().
For quick data exploration, base R plotting functions can provide an
expeditious and straightforward approach to understanding your data.
Data visualization skills are the abilities to visualize data to explore
patterns, relationships, and trends using functions like
ggplot2 for creating sophisticated and customizable plots,
plot() for basic plots, and ggplotly() from
the plotly package for interactive plots.
Skills to work with date and time data, including parsing,
formatting, extracting components (e.g., year, month, day), and
performing date arithmetic using functions like as.Date(),
as.POSIXct(), strftime(),
strptime(), lubridate package.
Ability to manipulate and process character strings, including
pattern matching, substring extraction, and string manipulation using
functions like grep(), sub(),
gsub(), strsplit(), and stringr
package.
Understanding of control structures like loops (for,
while) and conditional statements (if,
else, ifelse) to automate repetitive tasks and
apply conditional operations on data.
Skills to handle errors and exceptions that may occur during data
wrangling operations, including debugging techniques and using functions
like tryCatch().
Familiarity with efficient coding practices such as vectorization,
using R’s apply family of functions (apply(),
lapply(), sapply(), vapply(),
mapply()), and leveraging the capabilities of packages like
dplyr and tidyr for faster and more concise
code.
These skills are crucial for effectively managing and preparing data for analysis and visualization tasks in R. Continuous practice and exploration of R packages and functions related to data wrangling will enhance proficiency in handling diverse datasets and extracting meaningful insights.